Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow signal bootstrap sends to commands on cancel to be customized #1390

Merged
merged 2 commits into from
Mar 16, 2021
Merged

Allow signal bootstrap sends to commands on cancel to be customized #1390

merged 2 commits into from
Mar 16, 2021

Conversation

brentleyjones
Copy link
Contributor

Currently bootstrap sends a SIGTERM to commands when it, itself, is interrupted.

This adds cancel-signal which accepts signals in the form of SIGTERM, SIGHUP, etc.

This is the same config that start accepts, and is based on the PR that added that feature: #1041

Finally, start passes its cancel-signal to bootstrap, to prevent having to customize the bootstrap-script.

Currently bootstrap sends a `SIGTERM` to commands when it, itself, is interrupted.

This adds `cancel-signal` which accepts signals in the form of SIGTERM, SIGHUP, etc.

This is the same config that `start` accepts, and is based on the PR that added that feature: #1041

Finally, `start` passes its `cancel-signal` to `bootstrap`, to prevent having to customize the `bootstrap-script`.
@sj26 sj26 requested a review from pda March 3, 2021 22:24
@chloeruka chloeruka self-assigned this Mar 4, 2021
@chloeruka chloeruka self-requested a review March 4, 2021 03:43
Copy link
Contributor

@chloeruka chloeruka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @brentleyjones 🙏🏻

This is a straightforward implementation and safe improvement to current behaviour. It might be worth adding a test around to confirm the configuration signal reaches the shell configuration, but otherwise these are all well tested corners of the code.

We'll carry out our intention to transition the defaults to SIGINT in the future, but for now, it makes sense to add the configuration option and allow us to experiment with the finer details of how the change will affect users. I'm very appreciative of your well-reasoned need for using this to safely dispose of process groups, this is helpful for communicating why we will make this change in the next major release.

@pda
Copy link
Member

pda commented Mar 5, 2021

Excuse me while I think this through “out loud”…

The existing buildkite-agent start command has --cancel-signal=SIGTERM which can also be set via BUILDKITE_CANCEL_SIGNAL environment instead. This propagates into AgentWorker → JobRunner → process (where it's called InterruptSignal) and is used when process.Interrupt is called by JobRunner.Cancel. It sends the configured signal to the process group of the bootstrap process. My limited mental model of process groups is that generally all processes in the group would receive the signal — i.e. the bootstrap and all the commands it's running. Or is it only the process leader? Or “it depends”?

JobRunner TL;DR: on job cancellation, the job runner sends the bootstrap process (and its child processes?) the configured signal.

This PR introduces to buildkite-agent bootstrap a new --cancel-signal=SIGTERM which can also be set via BUILDKITE_CANCEL_SIGNAL environment when the bootstrap is started. This PR changes buildkite-agent start to pass its configured cancel signal through to the boostrap. The bootstrap uses this new config to set b.shell.InterruptSignal, which passes it to shell's process.InterruptSignal, which is used during Process.Interrupt, which is called by Shell.Interrupt. That's called when the cancelCh channel is written to (which logs “Received cancellation signal, interrupting”). That channel is written to when Bootstrap.Cancel() is called, which happens in the bootstrap signal handler. That signal handler listens to SIGHUP, SIGTERM, SIGINT, SIGQUIT.

Bootstrap TL;DR: bootstrap signal handler calls bootstrap.Cancel()b.shell.Interrupt()Process.Interrupt() which uses this new bootstrap --cancel-signal / BUILDKITE_CANCEL_SIGNAL instead of being hard-coded to SIGTERM.

So I think the intention of this PR is to make buildkite-agent start --cancel-signal=SIGINT actually send SIGINT to the build commands being run within the bootstrap. That sounds like it was the intention of the original option, and the bootstrap layer was unintentionally getting in the way. This PR does that by adding the same --cancel-signal option to bootstrap and passing the option through.

So … 👍🏼

My only feedback is that I think we should pass BUILDKITE_CANCEL_SIGNAL from JobRunner to Bootstrap via environment (which might already be happening implicitly?) rather than injecting --cancel-signal into the bootstrap command ARGV. Thoughts?

"%s bootstrap --cancel-signal %s",
shellwords.Quote(os.Args[0]),
cfg.CancelSignal,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that I think should be propagated via environment rather than CLI args. I think we have other options passing through via environment, perhaps most recently added by @ticky?

@pda
Copy link
Member

pda commented Mar 5, 2021

Something else we should think about is either test coverage, or at least a before/after proof copied/pasted into this comment thread showing that this results in the configured signal being sent to the actual commands being run within the bootstrap.

@brentleyjones
Copy link
Contributor Author

My only feedback is that I think we should pass BUILDKITE_CANCEL_SIGNAL from JobRunner to Bootstrap via environment (which might already be happening implicitly?) rather than injecting --cancel-signal into the bootstrap command ARGV. Thoughts?

That was what I tried first, but it wasn't being passed through. I would be fine with it working that way, the direct argument call was the easiest for me to get it to work.

@brentleyjones
Copy link
Contributor Author

Something else we should think about is either test coverage, or at least a before/after proof copied/pasted into this comment thread showing that this results in the configured signal being sent to the actual commands being run within the bootstrap.

I wasn't sure how to do the tests, but I would be supportive of that. I can instead create a little example using traps that shows which signals are being sent (it was how I noticed the issue to begin with). Just lmk.

@pda
Copy link
Member

pda commented Mar 16, 2021

Sorry for the silence @brentleyjones

I've pushed a commit to this PR which propagates the option to bootstrap via BUILDKITE_CANCEL_SIGNAL env rather than an argv flag; there's precedent for this in JobRunner.createEnvironment().

I was also unsure how to write tests for this, so I did some local testing. I built a simple binary that reports which signal it was killed by. I ran buildkite-agent start --cancel-signal SIGINT, and cancelled a build running on the main branch (top) and on this PR branch (bottom):

image

got signal: interrupt looks good 👍🏼

@yob yob merged commit c089744 into buildkite:main Mar 16, 2021
@yob
Copy link
Contributor

yob commented Mar 16, 2021

thanks @brentleyjones ❤️

@brentleyjones brentleyjones deleted the bj/allow-signal-bootstrap-sends-to-commands-on-cancel-to-be-customized branch March 16, 2021 12:40
@brentleyjones
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants